Introduction

Background

This project was done as a part of “Data Analysis” course as part of our teams studies in Digital Sciences for High-Tech in the University of Tel-Aviv. Our team has great interest in using our studies for exploring and in the future maybe even developing tools for improving the way we treat our planet. Therefore our subject is CO2 emissions in congestion with the world happiness index.

Goals

  1. Find the trends of CO2 emissions in different countries along the years 1700-2017.
  2. Find what countries had greater growth in emissions and in which years.
  3. Compare the leading countries with the highest recent trends of growth or decreased emission in correlation with the current and previous UN happiness Index.

Part 1: Importing Data and Packages

Data import

happy_index_2005 <- "Data/world-happiness-report-2005-2020.csv" 
happy_index_2021 <- "Data/world-happiness-report-2021.csv"
emission <- "Data/co2_emission.csv"
population <-"Data/population.csv"

df_2005 <- read.csv(happy_index_2005)
df_2021 <- read.csv(happy_index_2021)
df_emiss <- read.csv(emission, col.names = c("Entity", "Code" , "Year", "co2"))
df_pop <- read.csv(population, col.names = c("Entity", "Code" , "Year", "Population"))

The first Data consists of the UN happiness index for the years 2005-2017.

as_tibble(df_2005)

The second data is the happiness index report for 2021.

as_tibble(df_2021)

The third data contains CO2 emissions from around 1700 up until 2017, by countries and continents.

as_tibble(df_emiss)

The final data is a a list of population sizes by countries for about the same years as the CO2 data.

as_tibble(df_pop)

As seen in the table above, the main dataset regarding emissions has only 4 features, of which 2 are identical (countries and country code). We will now examine the summary of it’s characteristics:

summary(df_emiss)
    Entity              Code                Year           co2            
 Length:20853       Length:20853       Min.   :1751   Min.   :-6.255e+08  
 Class :character   Class :character   1st Qu.:1932   1st Qu.: 3.188e+05  
 Mode  :character   Mode  :character   Median :1971   Median : 3.829e+06  
                                       Mean   :1953   Mean   : 1.931e+08  
                                       3rd Qu.:1995   3rd Qu.: 3.707e+07  
                                       Max.   :2017   Max.   : 3.615e+10  

As seen above, we can see the minimal year, minimal amount of CO2 emissions and same for mean and max values. We also learn about the size of the data, around ~21k rows.

Part 2: Tidying the Data

Tidying Functions

We have just created the two main datasets to work with, which includes important features from all of the previous datasets and some features that we have created above:

  1. “df_all” which include all data per country per year from 1950 til 2017, introducing some new features:
  • Population, taken from the the population’s DF
  • CO2/Population ratio, calculated.
  • “Diff” column, which shows the difference between the previous and the current year emissions for each consecutive year.
  1. “df_merge”, which includes data for each country regarding:
  • The average ratio feature, which is calculated from 1950 to 2017 using the previously explained “ratio” feature.
  • The difference in co2 emission this period and the life-ladder in 2017.
  • The average population in those years.
  • The average CO2 emissions.
  • The 2017’s “Life.Ladder” - happiness index score.
  • “Sum.Dif” - summing over the diff feature.

The summary for df_merge:

summary(df_merge)
    Entity            avg.ratio        avg.Population         avg.co2         
 Length:138         Min.   : 0.03123   Min.   :2.900e+05   Min.   :2.539e+05  
 Class :character   1st Qu.: 0.67361   1st Qu.:4.486e+06   1st Qu.:4.375e+06  
 Mode  :character   Median : 3.04396   Median :9.762e+06   Median :2.017e+07  
                    Mean   : 4.87430   Mean   :4.360e+07   Mean   :1.902e+08  
                    3rd Qu.: 7.70521   3rd Qu.:2.873e+07   3rd Qu.:9.600e+07  
                    Max.   :25.67181   Max.   :1.296e+09   Max.   :5.558e+09  
  Life.Ladder       sum.dif          
 Min.   :2.662   Min.   :-675355128  
 1st Qu.:4.593   1st Qu.:    142380  
 Median :5.587   Median :   6399728  
 Mean   :5.463   Mean   : 102976353  
 3rd Qu.:6.262   3rd Qu.:  28576960  
 Max.   :7.788   Max.   :7786512016  

And for df_all:

summary(df_all)
    Entity              Code                Year           co2            
 Length:17854       Length:17854       Min.   :1950   Min.   :-6.255e+08  
 Class :character   Class :character   1st Qu.:1967   1st Qu.: 4.746e+05  
 Mode  :character   Mode  :character   Median :1984   Median : 4.430e+06  
                                       Mean   :1984   Mean   : 2.423e+08  
                                       3rd Qu.:2002   3rd Qu.: 4.593e+07  
                                       Max.   :2019   Max.   : 3.615e+10  
                                                      NA's   :3595        
      Diff              Population            ratio        
 Min.   :-600971748   Min.   :1.000e+03   Min.   :  0.000  
 1st Qu.:     -9810   1st Qu.:2.400e+05   1st Qu.:  0.408  
 Median :     43968   Median :3.386e+06   Median :  1.841  
 Mean   :   5310082   Mean   :6.108e+07   Mean   :  5.211  
 3rd Qu.:   1018269   3rd Qu.:1.230e+07   3rd Qu.:  6.384  
 Max.   :1543508239   Max.   :7.713e+09   Max.   :403.351  
 NA's   :3595         NA's   :914         NA's   :4509     

skim(df_merge)
-- Data Summary ------------------------
                           Values  
Name                       df_merge
Number of rows             138     
Number of columns          6       
_______________________            
Column type frequency:             
  character                1       
  numeric                  5       
________________________           
Group variables            None    

-- Variable type: character -----------------------------------------------------------
# A tibble: 1 x 8
  skim_variable n_missing complete_rate   min   max empty n_unique whitespace
* <chr>             <int>         <dbl> <int> <int> <int>    <int>      <int>
1 Entity                0             1     4    24     0      138          0

-- Variable type: numeric -------------------------------------------------------------
# A tibble: 5 x 11
  skim_variable  n_missing complete_rate         mean           sd       p0         p25
* <chr>              <int>         <dbl>        <dbl>        <dbl>    <dbl>       <dbl>
1 avg.ratio              0             1         4.87         5.71  3.12e-2       0.674
2 avg.Population         0             1  43601578.   146848485.    2.90e+5 4485573.   
3 avg.co2                0             1 190209421.   689048147.    2.54e+5 4375076.   
4 Life.Ladder            0             1         5.46         1.16  2.66e+0       4.59 
5 sum.dif                0             1 102976353.   694705139.   -6.75e+8  142380.   
          p50         p75          p100 hist 
*       <dbl>       <dbl>         <dbl> <chr>
1        3.04        7.71         25.7  ▇▃▁▁▁
2  9761806.   28733782.   1295924678.   ▇▁▁▁▁
3 20167978.   95996205.   5558492798.   ▇▁▁▁▁
4        5.59        6.26          7.79 ▂▆▇▇▅
5  6399728.   28576960.   7786512016    ▇▁▁▁▁

Part 3: Proccesing of the Data

Visualisation

Countries of the World

In this plot the different countries in the data are shown in deeper colors for higher values of emission to population ratio in the year of 2017

3.1 Countries Emission vs Populatio by Years and continents

The following plot shows for the years 2000-2017 the amount of CO2 emitted from each country vs it’s population. The ratio is shown as well as the size of each dot. For comfort of the view, each country is allocated to it’s fitting continent and a color.



d <- df_all %>%
  filter(Year > 2000) %>%
  filter(Entity != 'World')

d_cont<- data.frame(country = c(d$Entity))

d_cont$continent <- countrycode(sourcevar = d_cont[, 'country'],
                            origin = "country.name",
                            destination = "continent")
Some values were not matched unambiguously: Africa, Americas (other), Antarctic Fisheries, Asia, Asia and Pacific (other), Channel Islands, EU-28, Europe, Europe (other), International transport, Latin America, Micronesia (country), Middle East, North America, Oceania, Saint Barthlemy, Statistical differences, Timor
d$continent <- d_cont$continent

d <- na.omit(d)

fig <- d %>%
  plot_ly(
    x = ~co2, 
    y = ~Population, 
    size = ~ratio,
    frame = ~Year, 
    text = ~Entity, 
    color = ~continent,
    type = 'scatter',
    mode = 'markers'
  )
fig <- fig %>% layout(
      title = "Yearly CO2 Emissions by Countries vs Population",
      xaxis = list(
      type = "log"
    )
  )

fig
`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.

3.2 Commulative Emissions by Countries Over the Years 1950-2017

# d <- df_all[ , c("Entity", "Year", "Population")]
d <- df_all %>%
  filter(Entity != "World") %>%
  filter(co2 != 0)
d
pp <- streamgraph(d, key="Entity",
                  order = "asis",
                  value="co2", 
                  date="Year",
                  offset="zero",
                  sort="co2"
                  ) %>%
  sg_axis_y(tick_format = "e")  %>%
  sg_legend(show=TRUE, label="Country: ") %>%
  sg_title("CO2 Emissions by Years and Countries, 1950-2017")

pp
streamgraph_html returned an object of class `list` instead of a `shiny.tag`.streamgraph_html returned an object of class `list` instead of a `shiny.tag`.
CO2 Emissions by Years and Countries, 1950-2017

3.4 Yearly in CO2 Emissions for Top 10 Polluting Countries

top_emiss_2017 <- df_emiss %>%
  filter(Year == 2017) %>%
  filter(Code != '') %>%
  filter(Entity != "World") %>%
  top_n( 10, co2) %>%
  arrange(desc(co2)) %>%
  head(10)
  
top_countries = top_emiss_2017$Entity
g <- ggplot(data = df_emiss[df_emiss$Entity %in% top_countries, ] 
                          , aes(x = Year, y = co2, group = Entity)) + 
    geom_line() +
    labs(title = "Top 10 Countries by CO2 Emissions") + 
    geom_line(aes(col = Entity)) +
    theme_ipsum()
g <- ggplotly(g)
g
g <- ggplot(data = df_all[df_all$Entity %in% top_countries, ] 
                          , aes(x = Year, y = ratio, group = Entity)) + 
    geom_line(aes(col = Entity)) + 
    labs(title = "Top 10 Countries by CO2 Emissions to population ratio") +
    theme_ipsum()

g <- ggplotly(g)
g
NA

Boxplot for top 10 Countries

# TODO add boxplot for top 10 diff by year
d <- df_all[df_all$Entity %in% top_countries, ] 
d
fig <- plot_ly(d,
               x = ~Entity,
               y = ~Diff,
               type = "box",
               color = ~Year)

fig <- fig %>%  layout(
  title = "Difference in Emission for Top 10 Polluting Countries",
  boxmode = "group")
fig
Ignoring 31 observationsline.color doesn't (yet) support data arraysOnly one fillcolor per trace allowedline.color doesn't (yet) support data arraysOnly one fillcolor per trace allowed'layout' objects don't have these attributes: 'boxmode'
Valid attributes include:
'font', 'title', 'uniformtext', 'autosize', 'width', 'height', 'margin', 'computed', 'paper_bgcolor', 'plot_bgcolor', 'separators', 'hidesources', 'showlegend', 'colorway', 'datarevision', 'uirevision', 'editrevision', 'selectionrevision', 'template', 'modebar', 'newshape', 'activeshape', 'meta', 'transition', '_deprecated', 'clickmode', 'dragmode', 'hovermode', 'hoverdistance', 'spikedistance', 'hoverlabel', 'selectdirection', 'grid', 'calendar', 'xaxis', 'yaxis', 'ternary', 'scene', 'geo', 'mapbox', 'polar', 'radialaxis', 'angularaxis', 'direction', 'orientation', 'editType', 'legend', 'annotations', 'shapes', 'images', 'updatemenus', 'sliders', 'colorscale', 'coloraxis', 'metasrc', 'barmode', 'bargap', 'mapType'
Ignoring 31 observationsline.color doesn't (yet) support data arraysOnly one fillcolor per trace allowedline.color doesn't (yet) support data arraysOnly one fillcolor per trace allowed'layout' objects don't have these attributes: 'boxmode'
Valid attributes include:
'font', 'title', 'uniformtext', 'autosize', 'width', 'height', 'margin', 'computed', 'paper_bgcolor', 'plot_bgcolor', 'separators', 'hidesources', 'showlegend', 'colorway', 'datarevision', 'uirevision', 'editrevision', 'selectionrevision', 'template', 'modebar', 'newshape', 'activeshape', 'meta', 'transition', '_deprecated', 'clickmode', 'dragmode', 'hovermode', 'hoverdistance', 'spikedistance', 'hoverlabel', 'selectdirection', 'grid', 'calendar', 'xaxis', 'yaxis', 'ternary', 'scene', 'geo', 'mapbox', 'polar', 'radialaxis', 'angularaxis', 'direction', 'orientation', 'editType', 'legend', 'annotations', 'shapes', 'images', 'updatemenus', 'sliders', 'colorscale', 'coloraxis', 'metasrc', 'barmode', 'bargap', 'mapType'
NA

Top 10 Happiest countries Emmissions by Years


top_10_happiness <- df_merge %>%
  filter(rank(desc(Life.Ladder))<=10)


g <- ggplot(data = df_emiss[df_emiss$Entity %in% top_10_happiness$Entity, ] 
                          , aes(x = Year, y = co2, group = Entity)) + 
    geom_line() +
    labs(title = "Top 10 happiness Countries by CO2 Emissions") + 
    theme_ipsum() +
    geom_line(aes(col = Entity))
g <- ggplotly(g)
g

Modelling

In this section we are going to suggest two types of correlation based on our exploration of the data:

  1. There is correlation between the total trend of emissions and the current happiness index.
  2. There is correlation between current emission normalized by the population to the current happiness index.
g <- ggplot(data = df_merge, aes(x = avg.ratio, y = Life.Ladder)) + 
    geom_point(aes(colour = Entity)) + 
    labs(title = "Average Ratio vs Happiness Index") +
    theme_ipsum() 
g <- ggplotly(g)
g

In the following plot we have removed china and India because they are sort of “outliers” in a sense that they have displayed much larger grow in emissions compared to other countries, therefore making it very hard to displyon the same plot.

df <- subset(df_merge, Entity != "China" & Entity != "India" )

g <- ggplot(data = df, aes(x = Life.Ladder , y = sum.dif)) + 
    geom_point(aes(colour = Entity)) + 
    labs(title = "Sum of Emissions Difference vs 2017 Happiness Index (China, India ex.)") +
    theme_ipsum()
g <- ggplotly(g)
g

# Part 4: Modeling

We are now going to introduce the linear model:

linearmod  <- lm(Life.Ladder ~ avg.ratio, data=df_merge)

summary(linearmod)

Call:
lm(formula = Life.Ladder ~ avg.ratio, data = df_merge)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2447 -0.6262 -0.0959  0.6765  2.1682 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.88716    0.10612  46.055  < 2e-16 ***
avg.ratio    0.11818    0.01416   8.346 7.09e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9468 on 136 degrees of freedom
Multiple R-squared:  0.3387,    Adjusted R-squared:  0.3338 
F-statistic: 69.65 on 1 and 136 DF,  p-value: 7.093e-14
confint(linearmod, level=0.95)
                 2.5 %    97.5 %
(Intercept) 4.67731147 5.0970157
avg.ratio   0.09017982 0.1461869
plot(fitted(linearmod),residuals(linearmod))



g <- ggplot(df_merge, aes(x = avg.ratio, y = Life.Ladder)) +
  geom_point(aes(colour = Entity)) + 
  theme_bw() +
  stat_smooth(method = "lm") + 
  labs(title = "Linear Modelling Happiness Index and Average Ratio")
g <- ggplotly(g)
`geom_smooth()` using formula 'y ~ x'
g
NA
NA

numeric_tidy <- df_merge[-1]

corr_data <- cor(numeric_tidy)

g <- ggcorrplot(corr_data, hc.order = TRUE, type = "lower",
   outline.col = "white") +
  labs(title = "Correlation Matrix") +
  theme_ipsum()
   # colors = c("darkolivegreen2", "white", "darkolivegreen"))
g <- ggplotly(g)
g
NA
# TODO The smooth function doesn't work!!!!

gmod  <- lm(Life.Ladder ~ log(avg.ratio), data=df_merge)

summary(gmod)
#plot x vs. y
g <- ggplot( df_merge, aes(x=log(avg.ratio), y=Life.Ladder)) +
    geom_point(aes(colour = Life.Ladder)) +
    theme_ipsum() +
    labs(title = "Another model") +
    geom_smooth(method = "glm",  formula = y~log(x),
    method.args = list(family = "binomial"), 
    se = FALSE)
g <- ggplotly(g)
g

Add text what we need to understand and explain

Conclusions: present

#multiple regression
multi <- lm(Life.Ladder ~ avg.ratio + sum.dif + avg.co2 , data=df_merge)
summary(multi)

Now let’s say we want to compare the differences between the top 10 happiness countries avg.ratio is higher then other countries. In this case, we have two unrelated (i.e., independent or unpaired) groups of samples. Therefore, it’s possible to use an independent t-test to evaluate whether the means are different.

mA - Weighted average of the 10 most happiest countries. mB - Weighted average of the rest of the countries.

Our research questions:

Is the mean of top 10 happiness (mA) is greater than the mean of other countries (mB)? // Yarden - I fixed the sentence

Is the weighted average of the 10 happiest countries (mA) greater than the weighted average of the other countries (mB)?

H0:mA≥mB - The null hypothesis

Ha:mA<mB (less) - The alternative hypothesis

# Create a data frame
T_data <- df_merge %>%
  select(Entity, avg.ratio,) %>%
  mutate(group = ifelse(Entity %in% top_10_happiness$Entity, "Top 10 Contries", "Other Countries")) 
  

## Yarden - Need to add to T_data a column of Life.Ladder

## TODO make a graph that shows the difference of Life Ladder and avg.ratio by groups "top_10" and "others". 

g <- ggboxplot(T_data, x = "group", y = "avg.ratio", 
          color = "group", palette = c("#00AFBB", "#E7B800"),
        ylab = "avg. ratio", xlab = "Group") +
  theme_ipsum() +
  labs(title = "Average Ratio Compared - Top 10 Countries vs Other Countries")

g <- ggplotly(g)
g
NA
NA
#Do the two group have the same variances?
#We’ll use F-test to test for homogeneity in variances. 

res.ftest <- var.test(avg.ratio ~ group, data = T_data)
res.ftest

    F test to compare two variances

data:  avg.ratio by group
F = 3.8362, num df = 127, denom df = 9, p-value = 0.03252
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 1.132105 8.499782
sample estimates:
ratio of variances 
          3.836239 

The p-value of the F-test is p = 0.03252. It is lower than the significance level alpha = 0.05. In conclusion, there is a significant difference between the variances of the two data sets. Therefore, we used the T-test and assumed that the variances are not equal - according to case 1 of hypothesis testing.


# Compute t-test
res <- t.test(avg.ratio ~ group, data = T_data, var.equal = FALSE, alternative = "less")
res

    Welch Two Sample t-test

data:  avg.ratio by group
t = -5.1609, df = 15.107, p-value = 5.678e-05
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
      -Inf -3.574853
sample estimates:
mean in group Other Countries mean in group Top 10 Contries 
                     4.482087                      9.894609 
---
title: "Global Trends of CO2 Emission and Effects on the Global Hapinnes Index"
author: "Guy Krothammer, Yarden Katash, Guy Dahan"
subtitle: Data Analysis Project - Digital Sciences for Hightech
output:
  html_document:
    toc: yes
    df_print: paged
  html_notebook:
    toc: yes
    number_section: no
    # fig_width: 20
    # fig_height: 15
    fig_caption: yes
    df_print: kable
    fontsize: 11pt
    theme: united
    code_folding: hide
editor_options:
  chunk_output_type: inline
---

# Introduction

## Background

This project was done as a part of "Data Analysis" course as part of our teams 
studies in Digital Sciences for High-Tech in the University of Tel-Aviv. 
Our team has great interest in using our studies for exploring and in the future maybe even
developing tools for improving the way we treat our planet. Therefore our subject is
CO2 emissions in congestion with the world happiness index. 

## Goals

1. Find the trends of CO2 emissions in different countries along the years 1700-2017.
2. Find what countries had greater growth in emissions and in which years.
3. Compare the leading countries with the highest recent trends of growth or decreased emission
in correlation with the current and previous UN happiness Index. 

# Part 1: Importing Data and Packages

```{r, include=FALSE}
# install and import packages

install.packages(c("cowplot", "ggalt", "ggalt", "GGally", "Hmisc",
                   "ISLR", "skimr", "ggcorrplot", "ggpubr", "DiagrammeR",
                   "xts", "countrycode", "countrycode", "shiny"))

remotes::install_github("hrbrmstr/streamgraph")
remotes::install_github("martinctc/rwa") # r
install.packages("waffle", repos = "https://cinc.rud.is") #waffle plot
install.packages("ggchicklet", repos = "https://cinc.rud.is") # for rounded corners 
```

```{r, include=FALSE}
# import library
library(tidyverse)
library(ggchicklet)
library(cowplot)
library(hrbrthemes)
library(ggalt)
library(GGally)
library(rwa)
library(readr)
library(Hmisc)
library(tibble)
library(dplyr)
library(ISLR)
library(skimr)
library(ggcorrplot)
library(ggpubr)
library(xts)
library(countrycode)
library(plotly)
library(shiny)
library(streamgraph)
```

## Data import

```{r, results='hide'}
happy_index_2005 <- "Data/world-happiness-report-2005-2020.csv" 
happy_index_2021 <- "Data/world-happiness-report-2021.csv"
emission <- "Data/co2_emission.csv"
population <-"Data/population.csv"

df_2005 <- read.csv(happy_index_2005)
df_2021 <- read.csv(happy_index_2021)
df_emiss <- read.csv(emission, col.names = c("Entity", "Code" , "Year", "co2"))
df_pop <- read.csv(population, col.names = c("Entity", "Code" , "Year", "Population"))
```

The first Data consists of the UN happiness index for the years 2005-2017.

```{r, collapse=TRUE}
as_tibble(df_2005)
```
The second data is the happiness index report for 2021.

```{r, collapse=TRUE}
as_tibble(df_2021)
```
The third data contains CO2 emissions from around 1700 up until 2017, by countries and continents.

```{r, collapse=TRUE}
as_tibble(df_emiss)
```
The final data is a a list of population sizes by countries for about the same years as the CO2 data.

```{r, collapse=TRUE}
as_tibble(df_pop)
```

As seen in the table above, the main dataset regarding emissions has only 4 features, of which 2 are identical (countries and country code). We will now examine the summary of it's characteristics:

```{r}
summary(df_emiss)
```
As seen above, we can see the minimal year, minimal amount of CO2 emissions and same for mean and max values. 
We also learn about the size of the data, around ~21k rows. 

# Part 2: Tidying the Data

## Tidying Functions

```{r, echo=FALSE}

# add "dif" column

df_emiss <-df_emiss %>% 
  group_by(Entity) %>%
  mutate(Diff = co2 - lag(co2, default = co2[1]))

#filter year up 1950

df_pop_to_merge <- df_pop %>%
  filter(Year > 1949)

df_emiss_to_merge <- df_emiss %>%
  filter(Year > 1949)

#arrange data to merge emission and population 

df_emiss_to_merge$Entity <- as.character(df_emiss_to_merge$Entity)
df_pop_to_merge$Entity <- as.character(df_pop_to_merge$Entity)

df_all <- full_join(df_emiss_to_merge, df_pop_to_merge, by = c('Entity','Year'))

drops <- c("Code.y")
df_all <-df_all[, !(names(df_all) %in% drops)]

names(df_all)[names(df_all) == "Code.x"] <- "Code"

#add column of normalized emission: co2/population by year (avg.ratio)

df_all <- df_all %>%
  mutate(ratio = co2/Population)

### Create dataframe for both 30 years average and by countries

names(df_2005)
names(df_2005)[names(df_2005) == "ï..Country.name"] <- "Entity"

df_2017 <- df_2005 %>%
  filter(year == 2017)%>%
  select(c(Life.Ladder, Entity))

df_last_30 <- df_all %>%
  filter(Year >= 1987) %>%
  na.omit()

agg1 <- aggregate(df_last_30[,c('ratio', 'Population', 'co2')], list(df_last_30$Entity), mean) %>%
  set_names(nm=c('Entity', 'avg.ratio', 'avg.Population', 'avg.co2'))

agg2 <- aggregate(df_last_30[,c('Diff')], list(df_last_30$Entity), sum ) %>%
  set_names(nm=c('Entity', 'sum.dif'))
agg2

df_2017$Entity <- as.character(df_2017$Entity)
agg1$Entity <- as.character(agg1$Entity)
agg2$Entity <- as.character(agg2$Entity)

df_merge <- full_join(agg1, df_2017, by = c('Entity')) %>%
  full_join(agg2) %>%
  na.omit() # TODO check why 20 countries were removed

df_merge
```

We have just created the two main datasets to work with, which includes important features from all of the previous datasets and some features that we have created above:

1. "df_all" which include all data per country per year from 1950 til 2017, introducing some new features:
  * Population, taken from the the population's DF
  * CO2/Population ratio, calculated.
  * "Diff" column, which shows the difference between the previous and the current year emissions for each consecutive year.
  
1. "df_merge", which includes data for each country regarding:
  * The average ratio feature, which is calculated from 1950 to 2017 using the previously explained "ratio" feature. 
  * The difference in co2 emission this period and the life-ladder in 2017.
  * The average population in those years. 
  * The average CO2 emissions.
  * The 2017's "Life.Ladder" - happiness index score. 
  * "Sum.Dif" - summing over the diff feature. 

The summary for df_merge:

```{r, collapse=TRUE}
summary(df_merge)
```
And for df_all:
```{r, collapse=TRUE}
summary(df_all)
```

```{r, collapse=TRUE}
skim(df_merge) # TODO Mayber remove this ? seems to much
```

# Part 3: Proccesing of the Data

## Visualisation 

### Countries of the World
In this plot the different countries in the data are shown in deeper colors for higher values of emission to population ratio in the year of 2017
```{r}
d = df_all %>%
  filter(Year==2017)
d

l <- list(color = toRGB("grey"), width = 0.2)

# specify map projection/options
g <- list(
  showframe = FALSE,
  showcoastlines = FALSE,
  projection = list(type = 'Mercator')
)

p <- plot_geo(d) %>%
  add_trace(
    z = ~ratio, color = ~ratio, colors = 'Reds',
    text = ~Entity, locations = ~Code, marker = list(line = l)
  ) %>%
  colorbar(title = 'CO2/Population', thickness=15) %>%
  layout(
    title = 'World Ratio of CO2/Population',
    geo = g
  )

p<- ggplotly(p)
p
```


### 3.1 Countries Emission vs Populatio by Years and continents 

The following plot shows for the years 2000-2017 the amount of CO2 emitted from each country vs it's population. The ratio is shown as well as the size of each dot.
For comfort  of the view, each country is allocated to it's fitting continent and a color. 

```{r}


d <- df_all %>%
  filter(Year > 2000) %>%
  filter(Entity != 'World')

d_cont<- data.frame(country = c(d$Entity))

d_cont$continent <- countrycode(sourcevar = d_cont[, 'country'],
                            origin = "country.name",
                            destination = "continent")
d$continent <- d_cont$continent

d <- na.omit(d)

fig <- d %>%
  plot_ly(
    x = ~co2, 
    y = ~Population, 
    size = ~ratio,
    frame = ~Year, 
    text = ~Entity, 
    color = ~continent,
    type = 'scatter',
    mode = 'markers'
  )

fig <- fig %>% layout(
      title = "Yearly CO2 Emissions by Countries vs Population",
      xaxis = list(
      type = "log"
    )
  )

fig
```

### 3.2 Commulative Emissions by Countries Over the Years 1950-2017

```{r}
# d <- df_all[ , c("Entity", "Year", "Population")]
d <- df_all %>%
  filter(Entity != "World") %>%
  filter(co2 != 0)
d
pp <- streamgraph(d, key="Entity",
                  order = "asis",
                  value="co2", 
                  date="Year",
                  offset="zero",
                  sort="co2"
                  ) %>%
  sg_axis_y(tick_format = "e")  %>%
  sg_legend(show=TRUE, label="Country: ") %>%
  sg_title("CO2 Emissions by Years and Countries, 1950-2017")

pp


```
### 3.3 Commulative Happinness Trends in Single Country 

```{r}
pp <- streamgraph(df_2005, key = "Entity",
                  # order = "reverse",
                  value="Life.Ladder", 
                  date="year"
                  # offset="zero",
                  ) %>%
  sg_fill_brewer("Blues") %>%
  sg_legend(show=TRUE, label="Country: ") %>%
  sg_title("Happiness Index by Years and Countries, 1950-2017") 
  

pp
# g <- ggplot(data = df_2005, aes(x = year, y = Life.Ladder, group = Entity)) + 
#     geom_line() + 
#     geom_line(color="#69b3a2") +
#     theme_ipsum()
#     
# g <- g + labs(title = "Contries Hapinness by Years")
# g <- ggplotly(g)
# g

```

### 3.4 Yearly in CO2 Emissions for Top 10 Polluting Countries

```{r}
top_emiss_2017 <- df_emiss %>%
  filter(Year == 2017) %>%
  filter(Code != '') %>%
  filter(Entity != "World") %>%
  top_n( 10, co2) %>%
  arrange(desc(co2)) %>%
  head(10)
  
top_countries = top_emiss_2017$Entity
```

```{r}
g <- ggplot(data = df_emiss[df_emiss$Entity %in% top_countries, ] 
                          , aes(x = Year, y = co2, group = Entity)) + 
    geom_line() +
    labs(title = "Top 10 Countries by CO2 Emissions") + 
    geom_line(aes(col = Entity)) +
    theme_ipsum()
g <- ggplotly(g)
g
```


```{r}
g <- ggplot(data = df_all[df_all$Entity %in% top_countries, ] 
                          , aes(x = Year, y = ratio, group = Entity)) + 
    geom_line(aes(col = Entity)) + 
    labs(title = "Top 10 Countries by CO2 Emissions to Population ratio") +
    theme_ipsum()

g <- ggplotly(g)
g

```
### Boxplot for top 10 Countries 

```{r}
# TODO add boxplot for top 10 diff by year
d <- df_all[df_all$Entity %in% top_countries, ] 
d
fig <- plot_ly(d,
               x = ~Entity,
               y = ~Diff,
               type = "box",
               color = ~Year)

fig <- fig %>%  layout(
  title = "Difference in Emission for Top 10 Polluting Countries",
  boxmode = "group")
fig
  
```
### Top 10 Happiest countries Emmissions by Years

```{r}

top_10_happiness <- df_merge %>%
  filter(rank(desc(Life.Ladder))<=10)


g <- ggplot(data = df_emiss[df_emiss$Entity %in% top_10_happiness$Entity, ] 
                          , aes(x = Year, y = co2, group = Entity)) + 
    geom_line() +
    labs(title = "Top 10 happiness Countries by CO2 Emissions") + 
    theme_ipsum() +
    geom_line(aes(col = Entity))
g <- ggplotly(g)
g
```



## Modelling

In this section we are going to suggest two types of correlation based on our exploration of the data:

1. There is correlation between the total trend of emissions and the current happiness index.
2. There is correlation between current emission normalized by the population to the current happiness index.

```{r}
g <- ggplot(data = df_merge, aes(x = avg.ratio, y = Life.Ladder)) + 
    geom_point(aes(colour = Entity)) + 
    labs(title = "Average Ratio vs Happiness Index") +
    theme_ipsum() 
g <- ggplotly(g)
g
```

In the following plot we have removed china and India because they are sort of "outliers" in a sense that they have displayed much larger grow in emissions compared to other countries, therefore making it very hard to displyon the same plot. 

```{r}
df <- subset(df_merge, Entity != "China" & Entity != "India" )

g <- ggplot(data = df, aes(x = Life.Ladder , y = sum.dif)) + 
    geom_point(aes(colour = Entity)) + 
    labs(title = "Sum of Emissions Difference vs 2017 Happiness Index (China, India ex.)") +
    theme_ipsum()
g <- ggplotly(g)
g
```

 # Part 4: Modeling
 
 We are now going to introduce the linear model:
```{r}
linearmod  <- lm(Life.Ladder ~ avg.ratio, data=df_merge)

summary(linearmod)

confint(linearmod, level=0.95)
plot(fitted(linearmod),residuals(linearmod))


g <- ggplot(df_merge, aes(x = avg.ratio, y = Life.Ladder)) +
  geom_point(aes(colour = Entity)) + 
  theme_bw() +
  stat_smooth(method = "lm") + 
  labs(title = "Linear Modelling Happiness Index and Average Ratio")
g <- ggplotly(g)
g


```

```{r}

numeric_tidy <- df_merge[-1]

corr_data <- cor(numeric_tidy)

g <- ggcorrplot(corr_data, hc.order = TRUE, type = "lower",
   outline.col = "white") +
  labs(title = "Correlation Matrix") +
  theme_ipsum()
   # colors = c("darkolivegreen2", "white", "darkolivegreen"))
g <- ggplotly(g)
g

```
```{r}
# TODO The smooth function doesn't work!!!!

gmod  <- lm(Life.Ladder ~ log(avg.ratio), data=df_merge)

summary(gmod)
#plot x vs. y
g <- ggplot( df_merge, aes(x=log(avg.ratio), y=Life.Ladder)) +
    geom_point(aes(colour = Life.Ladder)) +
    theme_ipsum() +
    labs(title = "Another model") +
    geom_smooth(method = "glm",  formula = y~log(x),
    method.args = list(family = "binomial"), 
    se = FALSE)
g <- ggplotly(g)
g

```

# Add text what we need to understand and explain
# Conclusions: present 

```{r}
#multiple regression
multi <- lm(Life.Ladder ~ avg.ratio + sum.dif + avg.co2 , data=df_merge)
summary(multi)

```
Now let’s say we want to compare the differences between the top 10 happiness countries avg.ratio is higher then other countries. In this case, we have two unrelated (i.e., independent or unpaired) groups of samples. Therefore, it’s possible to use an independent t-test to evaluate whether the means are different. 

mA - Weighted average of the 10 most happiest countries.
mB - Weighted average of the rest of the countries.

Our research questions:


## Is the mean of top 10 happiness (mA) is greater than the mean of other countries (mB)? // Yarden - I fixed the sentence

Is the weighted average of the 10 happiest countries (mA) greater than the weighted average of the other countries (mB)?

H0:mA≥mB  - The null hypothesis

Ha:mA<mB (less) - The alternative hypothesis


```{r}
# Create a data frame
T_data <- df_merge %>%
  select(Entity, avg.ratio,) %>%
  mutate(group = ifelse(Entity %in% top_10_happiness$Entity, "Top 10 Contries", "Other Countries")) 
  

## TODO Yarden - Need to add to T_data a column of Life.Ladder

## TODO make a graph that shows the difference of Life Ladder and avg.ratio by groups "top_10" and "others". 

g <- ggboxplot(T_data, x = "group", y = "avg.ratio", 
          color = "group", palette = c("#00AFBB", "#E7B800"),
        ylab = "avg. ratio", xlab = "Group") +
  theme_ipsum() +
  labs(title = "Average Ratio Compared - Top 10 Countries vs Other Countries")

g <- ggplotly(g)
g


```

```{r}
#Do the two group have the same variances?
#We’ll use F-test to test for homogeneity in variances. 

res.ftest <- var.test(avg.ratio ~ group, data = T_data)
res.ftest

```
The p-value of the F-test is p = 0.03252. It is lower than the significance level alpha = 0.05. In conclusion, there is  a significant difference between the variances of the two data sets. Therefore, we used the T-test and assumed that the variances are not equal - according to case 1 of hypothesis testing.

```{r}

# Compute t-test
res <- t.test(avg.ratio ~ group, data = T_data, var.equal = FALSE, alternative = "less")
res

```